{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div align=\"center\">\n",
    "<h2>Evaluating RAG</h2>\n",
    "</div>\n",
    "\n",
    "<div align=\"center\">\n",
    "    <h3 ><a href=\"https://aiengineering.academy/\" target=\"_blank\">AI Engineering.academy</a></h3>\n",
    "    \n",
    "    \n",
    "</div>\n",
    "\n",
    "<div align=\"center\">\n",
    "<a href=\"https://aiengineering.academy/\" target=\"_blank\">\n",
    "<img src=\"https://raw.githubusercontent.com/adithya-s-k/AI-Engineering.academy/main/assets/banner.png\" alt=\"Ai Engineering. Academy\" width=\"50%\">\n",
    "</a>\n",
    "</div>\n",
    "\n",
    "\n",
    "<div align=\"center\">\n",
    "\n",
    "[![GitHub Stars](https://img.shields.io/github/stars/adithya-s-k/AI-Engineering.academy?style=social)](https://github.com/adithya-s-k/AI-Engineering.academy/stargazers)\n",
    "[![GitHub Forks](https://img.shields.io/github/forks/adithya-s-k/AI-Engineering.academy?style=social)](https://github.com/adithya-s-k/AI-Engineering.academy/network/members)\n",
    "[![GitHub Issues](https://img.shields.io/github/issues/adithya-s-k/AI-Engineering.academy)](https://github.com/adithya-s-k/AI-Engineering.academy/issues)\n",
    "[![GitHub Pull Requests](https://img.shields.io/github/issues-pr/adithya-s-k/AI-Engineering.academy)](https://github.com/adithya-s-k/AI-Engineering.academy/pulls)\n",
    "[![License](https://img.shields.io/github/license/adithya-s-k/AI-Engineering.academy)](https://github.com/adithya-s-k/AI-Engineering.academy/blob/main/LICENSE)\n",
    "\n",
    "</div>\n",
    "\n",
    "## Introduction\n",
    "\n",
    "Evaluation is a critical component in the development and optimization of Retrieval-Augmented Generation (RAG) systems. It involves assessing the performance, accuracy, and quality of various aspects of the RAG pipeline, from retrieval effectiveness to the relevance and faithfulness of generated responses.\n",
    "\n",
    "## Importance of Evaluation in RAG\n",
    "\n",
    "Effective evaluation of RAG systems is essential because it:\n",
    "1. Helps identify strengths and weaknesses in the retrieval and generation processes.\n",
    "2. Guides improvements and optimizations across the RAG pipeline.\n",
    "3. Ensures the system meets quality standards and user expectations.\n",
    "4. Facilitates comparison between different RAG implementations or configurations.\n",
    "5. Helps detect issues such as hallucinations, biases, or irrelevant responses.\n",
    "\n",
    "\n",
    "## Key Evaluation Metrics\n",
    "\n",
    "### RAGAS Metrics\n",
    "1. **Faithfulness**: Measures how well the generated response aligns with the retrieved context.\n",
    "2. **Answer Relevancy**: Assesses the relevance of the response to the query.\n",
    "3. **Context Recall**: Evaluates how well the retrieved chunks cover the information needed to answer the query.\n",
    "4. **Context Precision**: Measures the proportion of relevant information in the retrieved chunks.\n",
    "5. **Context Utilization**: Assesses how effectively the generated response uses the provided context.\n",
    "6. **Context Entity Recall**: Evaluates the coverage of important entities from the context in the response.\n",
    "7. **Noise Sensitivity**: Measures the system's robustness to irrelevant or noisy information.\n",
    "8. **Summarization Score**: Assesses the quality of summarization in the response.\n",
    "\n",
    "### DeepEval Metrics\n",
    "1. **G-Eval**: A general evaluation metric for text generation tasks.\n",
    "2. **Summarization**: Assesses the quality of text summarization.\n",
    "3. **Answer Relevancy**: Measures how well the response answers the query.\n",
    "4. **Faithfulness**: Evaluates the accuracy of the response with respect to the source information.\n",
    "5. **Contextual Recall and Precision**: Measures the effectiveness of context retrieval.\n",
    "6. **Hallucination**: Detects fabricated or inaccurate information in the response.\n",
    "7. **Toxicity**: Identifies harmful or offensive content in the response.\n",
    "8. **Bias**: Detects unfair prejudice or favoritism in the generated content.\n",
    "\n",
    "### Trulens Metrics\n",
    "1. **Context Relevance**: Assesses how well the retrieved context matches the query.\n",
    "2. **Groundedness**: Measures how well the response is supported by the retrieved information.\n",
    "3. **Answer Relevance**: Evaluates how well the response addresses the query.\n",
    "4. **Comprehensiveness**: Assesses the completeness of the response.\n",
    "5. **Harmful/Toxic Language**: Identifies potentially offensive or dangerous content.\n",
    "6. **User Sentiment**: Analyzes the emotional tone of user interactions.\n",
    "7. **Language Mismatch**: Detects inconsistencies in language use between query and response.\n",
    "8. **Fairness and Bias**: Evaluates the system for equitable treatment across different groups.\n",
    "9. **Custom Feedback Functions**: Allows for tailored evaluation metrics specific to use cases.\n",
    "\n",
    "## Best Practices for RAG Evaluation\n",
    "\n",
    "1. **Comprehensive Evaluation**: Use a combination of metrics to assess different aspects of the RAG system.\n",
    "2. **Regular Benchmarking**: Continuously evaluate the system as changes are made to the pipeline.\n",
    "3. **Human-in-the-Loop**: Incorporate human evaluation alongside automated metrics for a holistic assessment.\n",
    "4. **Domain-Specific Metrics**: Develop custom metrics relevant to your specific use case or domain.\n",
    "5. **Error Analysis**: Investigate patterns in low-scoring responses to identify areas for improvement.\n",
    "6. **Comparative Evaluation**: Benchmark your RAG system against baseline models and alternative implementations.\n",
    "\n",
    "## Conclusion\n",
    "\n",
    "A robust evaluation framework is crucial for developing and maintaining high-quality RAG systems. By leveraging a diverse set of metrics and following best practices, developers can ensure their RAG systems deliver accurate, relevant, and trustworthy responses while continuously improving performance."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setting up the Environment"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install llama-index\n",
    "!pip install llama-index-vector-stores-qdrant \n",
    "!pip install llama-index-readers-file \n",
    "!pip install llama-index-embeddings-fastembed \n",
    "!pip install llama-index-llms-openai\n",
    "!pip install llama-index-llms-groq\n",
    "!pip install -U qdrant_client fastembed\n",
    "!pip install python-dotenv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Standard library imports\n",
    "import logging\n",
    "import sys\n",
    "import os\n",
    "\n",
    "# Third-party imports\n",
    "from dotenv import load_dotenv\n",
    "from IPython.display import Markdown, display\n",
    "\n",
    "# Qdrant client import\n",
    "import qdrant_client\n",
    "\n",
    "# LlamaIndex core imports\n",
    "from llama_index.core import VectorStoreIndex, SimpleDirectoryReader\n",
    "from llama_index.core import Settings\n",
    "\n",
    "# LlamaIndex vector store import\n",
    "from llama_index.vector_stores.qdrant import QdrantVectorStore\n",
    "\n",
    "# Embedding model imports\n",
    "from llama_index.embeddings.fastembed import FastEmbedEmbedding\n",
    "from llama_index.embeddings.openai import OpenAIEmbedding\n",
    "\n",
    "# LLM import\n",
    "from llama_index.llms.openai import OpenAI\n",
    "# from llama_index.llms.groq import Groq\n",
    "# Load environment variables\n",
    "load_dotenv()\n",
    "\n",
    "# Get OpenAI API key from environment variables\n",
    "OPENAI_API_KEY = os.getenv(\"OPENAI_API_KEY\")\n",
    "GROK_API_KEY = os.getenv(\"GROQ_API_KEY\")\n",
    "\n",
    "# Setting up Base LLM\n",
    "Settings.llm = OpenAI(\n",
    "    model=\"gpt-4o-mini\", temperature=0.1, max_tokens=1024, streaming=True\n",
    ")\n",
    "\n",
    "# Settings.llm = Groq(model=\"llama3-70b-8192\" , api_key=GROK_API_KEY)\n",
    "\n",
    "# Set the embedding model\n",
    "# Option 1: Use FastEmbed with BAAI/bge-base-en-v1.5 model (default)\n",
    "# Settings.embed_model = FastEmbedEmbedding(model_name=\"BAAI/bge-base-en-v1.5\")\n",
    "\n",
    "# Option 2: Use OpenAI's embedding model (commented out)\n",
    "# If you want to use OpenAI's embedding model, uncomment the following line:\n",
    "Settings.embed_model = OpenAIEmbedding(embed_batch_size=10, api_key=OPENAI_API_KEY)\n",
    "\n",
    "# Qdrant configuration (commented out)\n",
    "# If you're using Qdrant, uncomment and set these variables:\n",
    "# QDRANT_CLOUD_ENDPOINT = os.getenv(\"QDRANT_CLOUD_ENDPOINT\")\n",
    "# QDRANT_API_KEY = os.getenv(\"QDRANT_API_KEY\")\n",
    "\n",
    "# Note: Remember to add QDRANT_CLOUD_ENDPOINT and QDRANT_API_KEY to your .env file if using Qdrant Hosted version"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load the Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# lets loading the documents using SimpleDirectoryReader\n",
    "\n",
    "print(\"🔃 Loading Data\")\n",
    "\n",
    "from llama_index.core import Document\n",
    "reader = SimpleDirectoryReader(\"../data/\" , recursive=True)\n",
    "documents = reader.load_data(show_progress=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setting up Vector Database\n",
    "\n",
    "We will be using qDrant as the Vector database\n",
    "There are 4 ways to initialize qdrant \n",
    "\n",
    "1. Inmemory\n",
    "```python\n",
    "client = qdrant_client.QdrantClient(location=\":memory:\")\n",
    "```\n",
    "2. Disk\n",
    "```python\n",
    "client = qdrant_client.QdrantClient(path=\"./data\")\n",
    "```\n",
    "3. Self hosted or Docker\n",
    "```python\n",
    "\n",
    "client = qdrant_client.QdrantClient(\n",
    "    # url=\"http://<host>:<port>\"\n",
    "    host=\"localhost\",port=6333\n",
    ")\n",
    "```\n",
    "\n",
    "4. Qdrant cloud\n",
    "```python\n",
    "client = qdrant_client.QdrantClient(\n",
    "    url=QDRANT_CLOUD_ENDPOINT,\n",
    "    api_key=QDRANT_API_KEY,\n",
    ")\n",
    "```\n",
    "\n",
    "for this notebook we will be using qdrant cloud"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# creating a qdrant client instance\n",
    "\n",
    "client = qdrant_client.QdrantClient(\n",
    "    # you can use :memory: mode for fast and light-weight experiments,\n",
    "    # it does not require to have Qdrant deployed anywhere\n",
    "    # but requires qdrant-client >= 1.1.1\n",
    "    # location=\":memory:\"\n",
    "    # otherwise set Qdrant instance address with:\n",
    "    # url=QDRANT_CLOUD_ENDPOINT,\n",
    "    # otherwise set Qdrant instance with host and port:\n",
    "    host=\"localhost\",\n",
    "    port=6333\n",
    "    # set API KEY for Qdrant Cloud\n",
    "    # api_key=QDRANT_API_KEY,\n",
    "    # path=\"./db/\"\n",
    ")\n",
    "\n",
    "vector_store = QdrantVectorStore(client=client, collection_name=\"01_RAG_Evaluation\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Ingest Data into vector DB"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "## ingesting data into vector database\n",
    "\n",
    "## lets set up an ingestion pipeline\n",
    "\n",
    "from llama_index.core.node_parser import TokenTextSplitter\n",
    "from llama_index.core.node_parser import SentenceSplitter\n",
    "from llama_index.core.node_parser import MarkdownNodeParser\n",
    "from llama_index.core.node_parser import SemanticSplitterNodeParser\n",
    "from llama_index.core.ingestion import IngestionPipeline\n",
    "\n",
    "pipeline = IngestionPipeline(\n",
    "    transformations=[\n",
    "        # MarkdownNodeParser(include_metadata=True),\n",
    "        # TokenTextSplitter(chunk_size=500, chunk_overlap=20),\n",
    "        SentenceSplitter(chunk_size=1024, chunk_overlap=20),\n",
    "        # SemanticSplitterNodeParser(buffer_size=1, breakpoint_percentile_threshold=95 , embed_model=Settings.embed_model),\n",
    "        Settings.embed_model,\n",
    "    ],\n",
    "    vector_store=vector_store,\n",
    ")\n",
    "\n",
    "# Ingest directly into a vector db\n",
    "nodes = pipeline.run(documents=documents , show_progress=True)\n",
    "print(\"Number of chunks added to vector DB :\",len(nodes))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setting Up Index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "index = VectorStoreIndex.from_vector_store(vector_store=vector_store)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Modifying Prompts and Prompt Tuning"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core import ChatPromptTemplate\n",
    "\n",
    "qa_prompt_str = (\n",
    "    \"Context information is below.\\n\"\n",
    "    \"---------------------\\n\"\n",
    "    \"{context_str}\\n\"\n",
    "    \"---------------------\\n\"\n",
    "    \"Given the context information and not prior knowledge, \"\n",
    "    \"answer the question: {query_str}\\n\"\n",
    ")\n",
    "\n",
    "refine_prompt_str = (\n",
    "    \"We have the opportunity to refine the original answer \"\n",
    "    \"(only if needed) with some more context below.\\n\"\n",
    "    \"------------\\n\"\n",
    "    \"{context_msg}\\n\"\n",
    "    \"------------\\n\"\n",
    "    \"Given the new context, refine the original answer to better \"\n",
    "    \"answer the question: {query_str}. \"\n",
    "    \"If the context isn't useful, output the original answer again.\\n\"\n",
    "    \"Original Answer: {existing_answer}\"\n",
    ")\n",
    "\n",
    "# Text QA Prompt\n",
    "chat_text_qa_msgs = [\n",
    "    (\"system\",\"You are a AI assistant who is well versed with answering questions from the provided context\"),\n",
    "    (\"user\", qa_prompt_str),\n",
    "]\n",
    "text_qa_template = ChatPromptTemplate.from_messages(chat_text_qa_msgs)\n",
    "\n",
    "# Refine Prompt\n",
    "chat_refine_msgs = [\n",
    "    (\"system\",\"Always answer the question, even if the context isn't helpful.\",),\n",
    "    (\"user\", refine_prompt_str),\n",
    "]\n",
    "refine_template = ChatPromptTemplate.from_messages(chat_refine_msgs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Example of Retrivers \n",
    "\n",
    "- Query Engine\n",
    "- Chat Engine"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Setting up Query Engine\n",
    "BASE_RAG_QUERY_ENGINE = index.as_query_engine(\n",
    "        similarity_top_k=5,\n",
    "        text_qa_template=text_qa_template,\n",
    "        refine_template=refine_template,)\n",
    "\n",
    "response = BASE_RAG_QUERY_ENGINE.query(\"How many encoders are stacked in the encoder?\")\n",
    "display(Markdown(str(response)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Setting up Chat Engine\n",
    "BASE_RAG_CHAT_ENGINE = index.as_chat_engine()\n",
    "\n",
    "response = BASE_RAG_CHAT_ENGINE.chat(\"How many encoders are stacked in the encoder?\")\n",
    "display(Markdown(str(response)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Setup Observability"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install arize-phoenix\n",
    "!pip install openinference-instrumentation-llama-index\n",
    "!pip install -U llama-index-callbacks-arize-phoenix"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import phoenix as px\n",
    "\n",
    "# (session := px.launch_app()).view()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from openinference.instrumentation.langchain import LangChainInstrumentor\n",
    "from openinference.instrumentation.llama_index import LlamaIndexInstrumentor\n",
    "from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter\n",
    "from opentelemetry.sdk.trace import SpanLimits, TracerProvider\n",
    "from opentelemetry.sdk.trace.export import SimpleSpanProcessor\n",
    "\n",
    "endpoint = \"http://127.0.0.1:6006/v1/traces\"\n",
    "tracer_provider = TracerProvider(span_limits=SpanLimits(max_attributes=100_000))\n",
    "tracer_provider.add_span_processor(SimpleSpanProcessor(OTLPSpanExporter(endpoint)))\n",
    "\n",
    "LlamaIndexInstrumentor().instrument(skip_dep_check=True, tracer_provider=tracer_provider)\n",
    "# LangChainInstrumentor().instrument(skip_dep_check=True, tracer_provider=tracer_provider)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generating Test Dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Curating a golden test dataset for evaluation can be a long, tedious, and expensive process that is not pragmatic — especially when starting out or when data sources keep changing. This can be solved by synthetically generating high quality data points, which then can be verified by developers. This can reduce the time and effort in curating test data by 90%."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install ragas"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import pandas as pd\n",
    "from phoenix.trace import using_project\n",
    "from ragas.testset.evolutions import multi_context, reasoning, simple\n",
    "from ragas.testset.generator import TestsetGenerator\n",
    "\n",
    "TEST_SIZE = 5\n",
    "CACHE_FILE = \"eval_testset.csv\"\n",
    "\n",
    "def generate_and_save_testset():\n",
    "    # Generator with openai models\n",
    "    generator = TestsetGenerator.with_openai()\n",
    "\n",
    "    # Set question type distribution\n",
    "    distribution = {simple: 0.5, reasoning: 0.25, multi_context: 0.25}\n",
    "\n",
    "    # Generate testset\n",
    "    with using_project(\"ragas-testset\"):\n",
    "        testset = generator.generate_with_llamaindex_docs(\n",
    "            documents, test_size=TEST_SIZE, distributions=distribution\n",
    "        )\n",
    "    \n",
    "    test_df = (\n",
    "        testset.to_pandas()\n",
    "        .sort_values(\"question\")\n",
    "        .drop_duplicates(subset=[\"question\"], keep=\"first\")\n",
    "    )\n",
    "    \n",
    "    # Save the dataset locally\n",
    "    test_df.to_csv(CACHE_FILE, index=False)\n",
    "    print(f\"Test dataset saved to {CACHE_FILE}\")\n",
    "    \n",
    "    return test_df\n",
    "\n",
    "def load_or_generate_testset():\n",
    "    if os.path.exists(CACHE_FILE):\n",
    "        print(f\"Loading existing test dataset from {CACHE_FILE}\")\n",
    "        test_df = pd.read_csv(CACHE_FILE)\n",
    "    else:\n",
    "        print(\"Generating new test dataset...\")\n",
    "        test_df = generate_and_save_testset()\n",
    "    \n",
    "    return test_df\n",
    "\n",
    "# Main execution\n",
    "test_df = load_or_generate_testset()\n",
    "print(test_df.head(5))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You are free to change the question type distribution according to your needs. Since we now have our test dataset ready, let’s move on and build a simple RAG pipeline using LlamaIndex."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### RAGAS"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from phoenix.trace.dsl.helpers import SpanQuery\n",
    "\n",
    "client = px.Client()\n",
    "corpus_df = px.Client().query_spans(\n",
    "    SpanQuery().explode(\n",
    "        \"embedding.embeddings\",\n",
    "        text=\"embedding.text\",\n",
    "        vector=\"embedding.vector\",\n",
    "    ),\n",
    "    project_name=\"indexing\",\n",
    ")\n",
    "corpus_df.head(2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from datasets import Dataset\n",
    "from phoenix.trace import using_project\n",
    "from tqdm.auto import tqdm\n",
    "\n",
    "\n",
    "def generate_response(query_engine, question):\n",
    "    response = query_engine.query(question)\n",
    "    return {\n",
    "        \"answer\": response.response,\n",
    "        \"contexts\": [c.node.get_content() for c in response.source_nodes],\n",
    "    }\n",
    "\n",
    "\n",
    "def generate_ragas_dataset(query_engine, test_df):\n",
    "    test_questions = test_df[\"question\"].values\n",
    "    responses = [generate_response(query_engine, q) for q in tqdm(test_questions)]\n",
    "\n",
    "    dataset_dict = {\n",
    "        \"question\": test_questions,\n",
    "        \"answer\": [response[\"answer\"] for response in responses],\n",
    "        \"contexts\": [response[\"contexts\"] for response in responses],\n",
    "        \"ground_truth\": test_df[\"ground_truth\"].values.tolist(),\n",
    "    }\n",
    "    ds = Dataset.from_dict(dataset_dict)\n",
    "    return ds\n",
    "\n",
    "\n",
    "with using_project(\"llama-index\"):\n",
    "    ragas_eval_dataset = generate_ragas_dataset(BASE_RAG_QUERY_ENGINE, test_df)\n",
    "\n",
    "ragas_evals_df = pd.DataFrame(ragas_eval_dataset)\n",
    "ragas_evals_df.head(2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from phoenix.trace.dsl.helpers import SpanQuery\n",
    "# dataset containing embeddings for visualization\n",
    "query_embeddings_df = px.Client().query_spans(\n",
    "    SpanQuery().explode(\"embedding.embeddings\", text=\"embedding.text\", vector=\"embedding.vector\"),\n",
    "    project_name=\"llama-index\",\n",
    ")\n",
    "query_embeddings_df.head(2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from phoenix.session.evaluation import get_qa_with_reference\n",
    "\n",
    "# dataset containing span data for evaluation with Ragas\n",
    "spans_dataframe = get_qa_with_reference(client, project_name=\"llama-index\")\n",
    "spans_dataframe.head(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ragas uses LangChain to evaluate your LLM application data. Since we initialized the LangChain instrumentation above we can see what's going on under the hood when we evaluate our LLM application."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from phoenix.trace import using_project\n",
    "from ragas import evaluate\n",
    "from ragas.metrics import (\n",
    "    answer_correctness,\n",
    "    context_precision,\n",
    "    context_recall,\n",
    "    faithfulness,\n",
    ")\n",
    "\n",
    "# Log the traces to the project \"ragas-evals\" just to view\n",
    "# how Ragas works under the hood\n",
    "with using_project(\"ragas-evals\"):\n",
    "    evaluation_result = evaluate(\n",
    "        dataset=ragas_eval_dataset,\n",
    "        metrics=[faithfulness, answer_correctness, context_recall, context_precision],\n",
    "    )\n",
    "eval_scores_df = pd.DataFrame(evaluation_result.scores)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Assign span ids to your ragas evaluation scores (needed so Phoenix knows where to attach the spans).\n",
    "span_questions = (\n",
    "    spans_dataframe[[\"input\"]]\n",
    "    .sort_values(\"input\")\n",
    "    .drop_duplicates(subset=[\"input\"], keep=\"first\")\n",
    "    .reset_index()\n",
    "    .rename({\"input\": \"question\"}, axis=1)\n",
    ")\n",
    "ragas_evals_df = ragas_evals_df.merge(span_questions, on=\"question\").set_index(\"context.span_id\")\n",
    "test_df = test_df.merge(span_questions, on=\"question\").set_index(\"context.span_id\")\n",
    "eval_data_df = pd.DataFrame(evaluation_result.dataset)\n",
    "eval_data_df = eval_data_df.merge(span_questions, on=\"question\").set_index(\"context.span_id\")\n",
    "eval_scores_df.index = eval_data_df.index\n",
    "\n",
    "query_embeddings_df = (\n",
    "    query_embeddings_df.sort_values(\"text\")\n",
    "    .drop_duplicates(subset=[\"text\"])\n",
    "    .rename({\"text\": \"question\"}, axis=1)\n",
    "    .merge(span_questions, on=\"question\")\n",
    "    .set_index(\"context.span_id\")\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from phoenix.trace import SpanEvaluations\n",
    "\n",
    "# Log the evaluations to Phoenix under the project \"llama-index\"\n",
    "# This will allow you to visualize the scores alongside the spans in the UI\n",
    "for eval_name in eval_scores_df.columns:\n",
    "    evals_df = eval_scores_df[[eval_name]].rename(columns={eval_name: \"score\"})\n",
    "    evals = SpanEvaluations(eval_name, evals_df)\n",
    "    px.Client().log_evaluations(evals)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Deep Eval"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install deepeval"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "from deepeval.integrations.llama_index import (\n",
    "    DeepEvalAnswerRelevancyEvaluator,\n",
    "    DeepEvalFaithfulnessEvaluator,\n",
    "    DeepEvalContextualRelevancyEvaluator,\n",
    "    DeepEvalSummarizationEvaluator,\n",
    "    DeepEvalBiasEvaluator,\n",
    "    DeepEvalToxicityEvaluator,\n",
    ")\n",
    "\n",
    "# An example input to your RAG application\n",
    "test_questions = test_df[\"question\"].values\n",
    "for q in tqdm(test_questions):\n",
    "\n",
    "    # LlamaIndex returns a response object that contains\n",
    "    # both the output string and retrieved nodes\n",
    "    response_object = BASE_RAG_QUERY_ENGINE.query(q)\n",
    "\n",
    "    # Create a list of all evaluators\n",
    "    evaluators = [\n",
    "        DeepEvalAnswerRelevancyEvaluator(model=\"gpt-4o-mini\"),\n",
    "        DeepEvalFaithfulnessEvaluator(model=\"gpt-4o-mini\"),\n",
    "        DeepEvalContextualRelevancyEvaluator(model=\"gpt-4o-mini\"),\n",
    "        DeepEvalSummarizationEvaluator(model=\"gpt-4o-mini\"),\n",
    "        DeepEvalBiasEvaluator(model=\"gpt-4o-mini\"),\n",
    "        DeepEvalToxicityEvaluator(model=\"gpt-4o-mini\"),\n",
    "    ]\n",
    "\n",
    "    # Evaluate the response using all evaluators\n",
    "    for evaluator in evaluators:\n",
    "        evaluation_result = evaluator.evaluate_response(\n",
    "            query=q, response=response_object\n",
    "        )\n",
    "        print(f\"{evaluator.__class__.__name__} Result:\")\n",
    "        print(evaluation_result)\n",
    "        print(\"\\n\" + \"=\"*50 + \"\\n\") "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "llm-venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
